unzip: the inverse of zip #33515

StefanKarpinski · 2019-10-09T18:12:15Z

Closes #13942. Alternative to #33324, with apologies to @bramtayl, who did a lot of work on unzip.

JeffBezanson · 2019-10-09T18:42:01Z

It seems better to return a tuple. Maybe something like

function unzip(itr)
    c = collect(itr)
    ntuple(i->map(t->getfield(t,i),c), Val(length(c[1])))
end

Of course the downside to that is it uses more temporary space. We could optimize it later by hooking in to the collect machinery. For example wrapping itr in an Unzip type and defining Base._similar_for to return a StructArray for an Unzip should magically work. I suppose we'd have to add a minimal StructArray-like type in Base.

quinnj · 2019-10-09T19:06:12Z

I've really enjoyed using https://github.com/piever/StructArrays.jl/ and seeing it mature, FWIW.

JeffBezanson · 2019-10-09T19:11:05Z

Definitely. If we wanted to add things to Base, StructArrays would be on my short list. It's arguably the most useful general "shape" of container that's not really provided by Base.

StefanKarpinski · 2019-10-09T23:59:29Z

It was simple to change this implementation to return a tuple: just convert it at the end. Since one expects the number of collections returned to be small, this should be quite cheap.

base/iterators.jl

StefanKarpinski · 2019-10-10T21:19:56Z

I think this is about the most straightforward eager unzip that doesn't allocate a lot of temporary data and iterates the iterators in the expected order. Do we want this or something else? I'll write tests, NEWS, etc. if we do.

The potential alternative that makes sense to me would be to have Unzip produce a tuple of iterators such that the first one produces that first values of each of the iterators in itrs, the second one produces the second value in each of the iterators in itrs, etc., but as was pointed out here, that's fundamentally incompatible with the order one wants to consume the iterators in. That is, in order to support consuming from the nth iterator before the earlier ones and still have them yield values, you'd need to consume and save all the values from earlier iterators, at which point you might as well just do this whole thing eagerly, which is what this PR does.

StefanKarpinski · 2019-10-10T21:25:31Z

One possible variation would be to call promote_type instead of promote_typejoin which would change this example to produce a concrete Float64 vector instead of an abstract Real one:

julia> unzip([[1, "apple"], [2.5, "orange"], [0, "mango"]])
([1.0, 2.5, 0.0], ["apple", "orange", "mango"])

bramtayl · 2019-10-10T23:31:06Z

How would the performance of this compare to my version?

StefanKarpinski · 2019-10-11T21:40:55Z

I haven't compared the performance, but if the goal is to get a tuple of unzipped vectors, this ought to be about as fast as possible since it doesn't do any extraneous work. Do you have a test case that you've been doing benchmarking on?

JeffBezanson · 2019-10-11T22:33:18Z

The performance issues I'd expect here would be the lack of type info in vecs, and that push! is always used even when the size is known (sizehint helps of course but push will probably still have a little overhead). StructArray has been designed so that you can collect into it without those issues.

StefanKarpinski · 2019-10-12T20:58:07Z

True. I could rewrite it so that it collects into a tuple and replaces the whole tuple if the eltype of any of the slots needs to change with the usual recursive trick.

StefanKarpinski · 2019-10-15T17:24:58Z

Here's a version that does the widening recursion trick. This should be fast in the case where itrs already returns tuples since inference should be able to reason about that. I'm not sure about the case where each itr yielded by itrs is not a tuple, since it will generally be tricky for the compiler to reason about. It's possible that having a function barrier around the inner for loop here would help in that case, although you'd still be doing a dynamic dispatch per iteration of itrs. What do you think, @JeffBezanson?

vtjnash · 2019-10-15T17:48:14Z

you'd still be doing a dynamic dispatch per iteration

You're doing that anyways, because of the call to zip. When all tuple eltypes are concrete, inference should elide it (via isa-branch splitting). If you used a loop over ii = eachindex(vecs) instead, it might do better, although Tuple(itr) was also already a dynamic call, and vals[ii] would still need to hit the runtime to lookup the layout (although not a dynamic call), so it's not necessarily actually any better.

StefanKarpinski · 2019-10-15T19:15:11Z

My main thought was to avoid any dynamic dispatch on each loop iteration and only do it once before the loop. I also figured that if the compiler knows about the length and type of each slot in vals then it will also know the length and types of eltypes and should be able to unroll the entire loop ideally. Otherwise, yes, you're likely dispatching all the time, in which case my original vector-based generic version is likely just as fast and doesn't tax the compiler as much.

StefanKarpinski · 2019-10-16T18:26:12Z

Neither not using zip in the for loop, nor factoring out the inner unzip loop seem to make any performance difference: when the "row type" is something inferrable like a tuple, it's fast, when it's something uninferrable like an array, it's slow. So the moral of the story is that if you're going to unzip something, make sure the inner collection types is tuple—which, fortunately, is what zip gives produces, so the natural use case of using zip, filtering or transforming tuples, and then unzipping to get "columns" back, is the fast case.

StefanKarpinski · 2019-10-16T18:46:54Z

Using the following test data:

itrs1 = collect(enumerate(eachline("/usr/share/dict/words")))
itrs2 = map(t->[t...], itrs1)
itrs3 = collect(enumerate(rand(1:100, 10^6)))
itrs4 = map(t->[t...], itrs3)

Benchmark results on this branch:

julia> @benchmark unzip($itrs1)
BenchmarkTools.Trial:
  memory estimate:  10.78 MiB
  allocs estimate:  470760
  --------------
  minimum time:     14.655 ms (0.00% GC)
  median time:      16.224 ms (0.00% GC)
  mean time:        16.807 ms (3.20% GC)
  maximum time:     23.046 ms (29.50% GC)
  --------------
  samples:          298
  evals/sample:     1

julia> @benchmark unzip($itrs2)
BenchmarkTools.Trial:
  memory estimate:  147.54 MiB
  allocs estimate:  4715666
  --------------
  minimum time:     522.516 ms (1.65% GC)
  median time:      546.774 ms (3.06% GC)
  mean time:        563.799 ms (2.50% GC)
  maximum time:     691.221 ms (1.36% GC)
  --------------
  samples:          10
  evals/sample:     1

julia> @benchmark unzip($itrs3)
BenchmarkTools.Trial:
  memory estimate:  61.03 MiB
  allocs estimate:  1999495
  --------------
  minimum time:     29.845 ms (0.00% GC)
  median time:      38.098 ms (16.04% GC)
  mean time:        36.911 ms (12.67% GC)
  maximum time:     44.987 ms (20.47% GC)
  --------------
  samples:          136
  evals/sample:     1

julia> @benchmark unzip($itrs4)
BenchmarkTools.Trial:
  memory estimate:  564.54 MiB
  allocs estimate:  17997947
  --------------
  minimum time:     1.594 s (3.17% GC)
  median time:      1.604 s (3.55% GC)
  mean time:        1.616 s (3.69% GC)
  maximum time:     1.661 s (4.44% GC)
  --------------
  samples:          4
  evals/sample:     1

Benchmark results on #33324:

julia> @benchmark unzip($itrs1)
BenchmarkTools.Trial:
  memory estimate:  3.60 MiB
  allocs estimate:  6
  --------------
  minimum time:     1.094 ms (0.00% GC)
  median time:      2.678 ms (0.00% GC)
  mean time:        2.573 ms (11.13% GC)
  maximum time:     12.254 ms (70.55% GC)
  --------------
  samples:          1938
  evals/sample:     1

julia> @benchmark unzip($itrs2)
ERROR: BoundsError: attempt to access 0-element Base.Rows{Tuple{},1,Tuple{}} at index [Base.OneTo(235886)]

julia> @benchmark unzip($itrs3)
BenchmarkTools.Trial:
  memory estimate:  15.26 MiB
  allocs estimate:  6
  --------------
  minimum time:     1.454 ms (0.00% GC)
  median time:      2.490 ms (0.00% GC)
  mean time:        3.486 ms (28.96% GC)
  maximum time:     12.152 ms (56.13% GC)
  --------------
  samples:          1432
  evals/sample:     1

julia> @benchmark unzip($itrs4)
ERROR: BoundsError: attempt to access 0-element Base.Rows{Tuple{},1,Tuple{}} at index [Base.OneTo(1000000)]

So @bramtayl's version does have really impressive performance in the inferrable case, but it fails in the non-inferrable case. I guess that means that this code can stand some optimization.

StefanKarpinski · 2019-10-16T20:20:06Z

Basically, it comes down to being tricky to convince the compiler to unroll the inner unzip loop even though it knows the length and type of the value tuple and the tuple of vectors to write into.

pdeffebach · 2020-12-13T04:21:05Z

I'm going to give this a friendly bump. It would be nice to have in 1.7.

JeffBezanson · 2020-12-14T22:09:12Z

If we put this in Base, I do think it should be inferable and as fast as possible, especially since some good implementations (for particular cases) are already available in packages.

If I'm reading it right the big issue with the implementation in #33515 (comment) is that it does multiple passes over the iterator, and so is not general enough.

I will try to experiment with the StructArrays kind of approach I was talking about above.

bramtayl · 2020-12-14T22:14:31Z

https://github.com/bramtayl/Unzip.jl exists and is registered (I think it's pretty performant, but I'm sure there's ways to speed it up more)

JeffBezanson · 2020-12-14T23:24:09Z

Yes, that's the approach I'm talking about. Unzip.jl seems quite good --- fast, infers, and no generated functions. StructArrays.jl does the same thing as well, but it is astonishing how different the code is. It uses some generated functions, and also re-implements some of the Base collect machinery (I believe to add features like allowing tuples and named tuples interchangeably?)

bramtayl · 2020-12-15T00:05:18Z

Well, make_columns in LightQuery handles NamedTuples https://github.com/bramtayl/LightQuery.jl/blob/master/src/make_columns.jl too, also sans generated functions (still with a @pure function though), using fairly similar code. But the implementation there doesn't work with tuples, and Unzip doesn't work with NamedTuples.

JeffreySarnoff · 2020-12-15T00:31:27Z

I am a proponent of allowing Tuples and NamedTuples interchangeably.

pdeffebach · 2020-12-15T00:53:31Z

I noted this in an issue at Unzip, but I should note that my main use-cast is when I do

map(x) do _
    (1, 2)
end

i.e. I use map and return a vector of tuples, but want a tuple of vectors. So maybe this is an XY problem. I would probably not see a huge benefit from a highly optimized unzip if what I really need is a map that returns Tuple.

bramtayl · 2020-12-15T02:58:50Z

Agreed it might be nice for it to work with tuples or namedtuples. What would be even cooler would be to merge namedtuples and tuples into the same thing, where you could refer to items by name if they have it, or else just by number. Something like ((name"a", 1), 2, (name"c", 3))

JeffBezanson · 2020-12-15T03:54:25Z

That design would certainly have its own trade-offs; I don't think it's universally better.

bramtayl · 2020-12-16T17:35:26Z

Out of curiosity, what would be the downsides?

JeffreySarnoff · 2020-12-19T03:24:34Z

While it may make sense to have

abstract type AbstractTuple end
Tuple <: AbstractTuple
NamedTuple <: AbstractTuple

Allowing unnamed entries within NamedTuples seems a confusing tack to me.

bramtayl · 2020-12-21T03:19:42Z

Hmm, well, I guess it's a mostly unrelated issue. But if we want an unzip that works with NamedTuples and Tuples, it would make it a lot easier if Tuples and NamedTuples were...less different?

pdeffebach · 2020-12-21T03:41:14Z

Implementing a really basic unzip would make my life easier in the short term and I would really like to have it. I don't think contemplating a re-design of the Tuples system is necessary in order to agree on an implementation.

vtjnash · 2021-04-08T22:48:06Z

bump! can we get tests and news so this can be closed?

stevengj · 2023-07-10T14:14:16Z

I added some tests, NEWS, and a compat annotation, and fixed a bug due to a missing import.

There is currently a test failure for the case of an empty iterator of iterators, where the behavior of returning () seems a bit off to me:

julia> itrs = ((), ())
((), ())

julia> collect.(itrs)
(Union{}[], Union{}[])

julia> unzip(zip(itrs...))
()

Note that this also makes unzip type-unstable:

julia> unzip(Tuple{Int8,Bool}[])
()

julia> unzip(Tuple{Int8,Bool}[(3,true)])
(Int8[3], Bool[1])

I know that the empty-iterator case is notoriously tricky, but can we do something better at least in the case where the eltype is known?

StefanKarpinski force-pushed the sk/unzip branch from 0f3305f to c820a50 Compare October 9, 2019 18:13

StefanKarpinski requested a review from JeffBezanson October 9, 2019 18:15

StefanKarpinski added needs compat annotation Add !!! compat "Julia x.y" to the docstring needs news A NEWS entry is required for this change needs tests Unit tests are required for this change labels Oct 9, 2019

StefanKarpinski force-pushed the sk/unzip branch from c820a50 to e4bdcac Compare October 9, 2019 18:17

StefanKarpinski force-pushed the sk/unzip branch 5 times, most recently from 5cadaf6 to 8c8f717 Compare October 9, 2019 21:06

GregPlowman reviewed Oct 10, 2019

View reviewed changes

base/iterators.jl Outdated Show resolved Hide resolved

StefanKarpinski force-pushed the sk/unzip branch from 1557a09 to 9ac2c0b Compare October 11, 2019 22:28

StefanKarpinski force-pushed the sk/unzip branch from 9ac2c0b to 4b839aa Compare October 15, 2019 17:19

pdeffebach mentioned this pull request Dec 14, 2020

map which returns tuples of vectors would also be nice bramtayl/Unzip.jl#2

Open

pdeffebach mentioned this pull request Mar 20, 2021

Table-like printing of a named-tuple of vectors #40121

Open

vtjnash mentioned this pull request Apr 8, 2021

WIP/ RFC/First PR: Add unzip to base.iterators.jl #21208

Closed

fmeirinhos mentioned this pull request Apr 17, 2021

v0.6 NonequilibriumDynamics/KadanoffBaym.jl#3

Merged

3 tasks

stevengj removed needs tests Unit tests are required for this change needs news A NEWS entry is required for this change needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Jul 10, 2023

StefanKarpinski and others added 6 commits July 10, 2023 10:20

unzip: the inverse of zip

72343f1

unzip: return tuple of vectors

e06bf65

unzip: now with widening recursion trick

1c10293

added tests, NEWS, and a bugfix

b35c5eb

added compat

a3405c1

fix bootstrap failure

9cc8b3d

stevengj force-pushed the sk/unzip branch from 93b1818 to 9cc8b3d Compare July 10, 2023 14:43

stevengj added the iteration Involves iteration or the iteration protocol label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unzip: the inverse of zip #33515

unzip: the inverse of zip #33515

StefanKarpinski commented Oct 9, 2019 •

edited

Loading

JeffBezanson commented Oct 9, 2019

quinnj commented Oct 9, 2019

JeffBezanson commented Oct 9, 2019

StefanKarpinski commented Oct 9, 2019

StefanKarpinski commented Oct 10, 2019

StefanKarpinski commented Oct 10, 2019

bramtayl commented Oct 10, 2019

StefanKarpinski commented Oct 11, 2019

JeffBezanson commented Oct 11, 2019

StefanKarpinski commented Oct 12, 2019

StefanKarpinski commented Oct 15, 2019

vtjnash commented Oct 15, 2019

StefanKarpinski commented Oct 15, 2019

StefanKarpinski commented Oct 16, 2019

StefanKarpinski commented Oct 16, 2019

StefanKarpinski commented Oct 16, 2019

pdeffebach commented Dec 13, 2020

JeffBezanson commented Dec 14, 2020

bramtayl commented Dec 14, 2020

JeffBezanson commented Dec 14, 2020

bramtayl commented Dec 15, 2020

JeffreySarnoff commented Dec 15, 2020

pdeffebach commented Dec 15, 2020

bramtayl commented Dec 15, 2020 •

edited

Loading

JeffBezanson commented Dec 15, 2020

bramtayl commented Dec 16, 2020

JeffreySarnoff commented Dec 19, 2020

bramtayl commented Dec 21, 2020

pdeffebach commented Dec 21, 2020

vtjnash commented Apr 8, 2021

stevengj commented Jul 10, 2023

unzip: the inverse of zip #33515

Are you sure you want to change the base?

unzip: the inverse of zip #33515

Conversation

StefanKarpinski commented Oct 9, 2019 • edited Loading

JeffBezanson commented Oct 9, 2019

quinnj commented Oct 9, 2019

JeffBezanson commented Oct 9, 2019

StefanKarpinski commented Oct 9, 2019

StefanKarpinski commented Oct 10, 2019

StefanKarpinski commented Oct 10, 2019

bramtayl commented Oct 10, 2019

StefanKarpinski commented Oct 11, 2019

JeffBezanson commented Oct 11, 2019

StefanKarpinski commented Oct 12, 2019

StefanKarpinski commented Oct 15, 2019

vtjnash commented Oct 15, 2019

StefanKarpinski commented Oct 15, 2019

StefanKarpinski commented Oct 16, 2019

StefanKarpinski commented Oct 16, 2019

StefanKarpinski commented Oct 16, 2019

pdeffebach commented Dec 13, 2020

JeffBezanson commented Dec 14, 2020

bramtayl commented Dec 14, 2020

JeffBezanson commented Dec 14, 2020

bramtayl commented Dec 15, 2020

JeffreySarnoff commented Dec 15, 2020

pdeffebach commented Dec 15, 2020

bramtayl commented Dec 15, 2020 • edited Loading

JeffBezanson commented Dec 15, 2020

bramtayl commented Dec 16, 2020

JeffreySarnoff commented Dec 19, 2020

bramtayl commented Dec 21, 2020

pdeffebach commented Dec 21, 2020

vtjnash commented Apr 8, 2021

stevengj commented Jul 10, 2023

StefanKarpinski commented Oct 9, 2019 •

edited

Loading

bramtayl commented Dec 15, 2020 •

edited

Loading